How do I properly size CPU & memory for my workloads

This document goal is to facilitate findingrunning Documentation Solution

Can be challenging to find the right amount of CPU/memory/GPU resources to certain AI Workloads.

Precision, batch size, model size, and context length are all tightly coupled to how much resource (especially GPU memory) is needed.

Recommended Values:

AI Workload	CPU Cores	Memory (GB)	GPU number	VRAM per GPU (GB)	Notes
Llama-7B	8–16	32–64	1	16+	Single GPU sufficient; fits L40, L40S, H100, H200
Llama-70B (FP16)	16–32	128–256	2 (H100)	80	Or 1× H200 (141 GB), 3× L40 (48 GB each)
Llama-70B (Quantized 8-bit)	16–32	128–256	1 (H100)	80	Or 2× L40 (48 GB each), depends on batch size
VLLM – Inference Server	16–32	64–128	Model-dependent	Model-dependent	See model requirements; e.g., 2× H100 for 70B
Nvidia NIM	16–32	64–128	Model-dependent	Model-dependent	2 x H100
Infinity Server (Embeddings)	8–16	32–64	1	8–16	Fits L40, L40S, H100, H200; often overprovisioned
Invoke (Image Generation)	8–16	32–64	1	8+	Preferably 16 GB; fits all specified GPUs